Retrieving Japanese specialized terms and corpora from the World Wide Web

نویسندگان

  • Marco Baroni
  • Motoko Ueyama
چکیده

The BootCaT toolkit (Baroni and Bernardini, 2004) is a suite of perl programs implementing a procedure to bootstrap specialized corpora and terms from the web using minimal knowledge sources. In this paper, we report ongoing work in which we apply the BootCaT procedure to a Japanese corpus and term extraction task in the hotel terminology domain. The results of our experiments are very encouraging, indicating that the BootCaT procedure can be successfully applied, with relatively small modifications, to a language very different from English and the other Indo-European languages on which we tested the procedure originally.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Building general- and special-purpose corpora by Web crawling

The Web is a potentially unlimited source of linguistic data; however, commercial search engines are not the best way for linguists to gather data from it. In this paper, we present a procedure to build language corpora by crawling and postprocessing Web data. We describe the construction of a very large Italian general-purpose Web corpus (almost 2 billion words) and a specialized Japanese “blo...

متن کامل

International Workshop Natural Language Processing Methods and Corpora in Translation, Lexicography, and Language Learning

TerminoWeb is a web-based platform designed to find and explore specialized domain knowledge on the Web. An important aspect of this exploration is the discovery of domain-specific collocations on the Web and their presentation in a concordancer to provide contextual information. Such information is valuable to a translator or a language learner presented with a source text containing a specifi...

متن کامل

Putting the „Wisdom of Crowds“ to Use in NLP: Collaboratively Constructed Semantic Resources on the Web

Since early 90 ies, the Web has served as a unique corpus with background knowledge for various NLP tasks. The Web as a corpus has been employed in three principal ways: (i) obtaining Web based frequencies for specific terms and constructions, (ii) collecting term specific Web corpora by retrieving the corresponding text snippets, and finally (iii) constructing task and domain targeted corpora ...

متن کامل

Compilation of Specialized Comparable Corpora in French and Japanese

We present in this paper the development of a specialized comparable corpora compilation tool, for which quality would be close to a manually compiled corpus. The comparability is based on three levels: domain, topic and type of discourse. Domain and topic can be filtered with the keywords used through web search. But the detection of the type of discourse needs a wide linguistic analysis. The ...

متن کامل

An Intelligent Multilingual Information Browsing and Retrieval System Using Information Extraction

In this paper, we describe our multilingual (or cross-linguistic) information browsing and retrieval system, which is aimed at monolingual users who are interested in information from multiple language sources. The system takes advantage of information extraction (IE) technology in novel ways to improve the accuracy o f cross-linguistic retrieval and to provide innovative methods for browsing a...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2004